Univariate analysis –data types and description of the independent attributes which should include (name, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions/tails, missing values, outliers, duplicates(10 Marks)
Bi-variate analysis between the predictor variables and between the predictor variables and target column. Comment on your findings in terms of their relationship and degree of relation if any. Visualize the analysis using boxplots and pair plots, histograms, or density curves. (10 marks)
Feature Engineering techniques(10 marks) Identify opportunities (if any) to extract new features from existing features, drop a feature(if required) Hint: Feature Extraction, for example, consider a dataset with two features length and breadth. From this, we can extract a new feature Area which would be length * breadth.
Get the data model ready and do a train test split.
Decide on the complexity of the model, should it be a simple linear model in terms of parameters or would a quadratic or higher degree.
Algorithms that you think will be suitable for this project. Use Kfold Cross-Validation to evaluate model performance. Use appropriate metrics and make a DataFrame to compare models w.r.t their metrics. (at least 3 algorithms, one bagging and one boosting based algorithms have to be there). (15 marks)
Techniques employed to squeeze that extra performance out of the model without making it overfit. Use Grid Search or Random Search on any of the two models used above. Make a DataFrame to compare models after hyperparameter tuning and their metrics as above. (15 marks)
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
#importing ploting libraries
import matplotlib.pyplot as plt
#importing seaborn for statistical plots
import seaborn as sns
#importing ploting libraries
import matplotlib.pyplot as plt
#styling figures
plt.rc('font',size=14)
sns.set(style='white')
sns.set(style='whitegrid',color_codes=True)
#To enable plotting graphs in Jupyter notebook
%matplotlib inline
#importing the feature scaling library
from sklearn.preprocessing import StandardScaler
#Import Sklearn package's data splitting function which is based on random function
from sklearn.model_selection import train_test_split
# Import Linear Regression, Ridge and Lasso machine learning library
from sklearn.linear_model import LinearRegression, Ridge, Lasso
# Import KNN Regressor machine learning library
from sklearn.neighbors import KNeighborsRegressor
# Import Decision Tree Regressor machine learning library
from sklearn.tree import DecisionTreeRegressor
# Import ensemble machine learning library
from sklearn.ensemble import (RandomForestRegressor, GradientBoostingRegressor,AdaBoostRegressor,BaggingRegressor)
# Import support vector regressor machine learning library
from sklearn.svm import SVR
#Import the metrics
from sklearn import metrics
#Import the Voting regressor for Ensemble
from sklearn.ensemble import VotingRegressor
# Import stats from scipy
from scipy import stats
# Import zscore for scaling
from scipy.stats import zscore
#importing the metrics
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
#importing the K fold
from sklearn.model_selection import KFold
#importing the cross validation score
from sklearn.model_selection import cross_val_score
#importing the preprocessing library
from sklearn import preprocessing
# importing the Polynomial features
from sklearn.preprocessing import PolynomialFeatures
#importing kmeans clustering library
from sklearn.cluster import KMeans
from sklearn.utils import resample
concrete_df=pd.read_csv('concrete.csv')
#Check the first five records
concrete_df.head()
concrete_df.info()
Observation
#Check the last few records
concrete_df.head()
#Info of the dataset
concrete_df.info()
It gives the details about the number of rows (1030), number of columns (9), data types information i.e. except age which is integer type all other columns are float type. Memory usage is 72.5 KB. Also,there are no null values in the data.
# Data type of the columns
concrete_df.dtypes
It gives the data types of each column of the dataset.
#To get the shape
concrete_df.shape
It gives the details of the number of rows and columns present in the dataset.There are 1030 rows and 9 columns.
#To get the columns name
concrete_df.columns
It gives the column names of the dataset.
# Five point summary
concrete_df.describe().T
can see that cement,slag,ash are left skewed
#Creating Profile Report for Analysis
#!pip install pandas_profiling
import pandas_profiling
concrete_df.profile_report()
import itertools
cols = [i for i in concrete_df.columns if i != 'strength']
fig = plt.figure(figsize=(15, 20))
for i,j in itertools.zip_longest(cols, range(len(cols))):
plt.subplot(4,2,j+1)
ax = sns.distplot(concrete_df[i],color='green',rug=True)
plt.axvline(concrete_df[i].mean(),linestyle="dashed",label="mean", color='black')
plt.legend()
plt.title(i)
plt.xlabel("")
plt.figure(figsize= (20,15))
plt.subplot(3,3,1)
sns.boxplot(x= concrete_df.age, color='green')
plt.subplot(3,3,2)
sns.boxplot(x= concrete_df.ash, color='green')
plt.subplot(3,3,3)
sns.boxplot(x= concrete_df.cement, color='green')
plt.show()
plt.figure(figsize= (20,15))
plt.subplot(4,4,1)
sns.boxplot(x= concrete_df.coarseagg, color='green')
plt.subplot(4,4,2)
sns.boxplot(x= concrete_df.fineagg, color='green')
plt.subplot(4,4,3)
sns.boxplot(x= concrete_df.strength, color='green')
plt.subplot(4,4,4)
sns.boxplot(x= concrete_df.superplastic, color='green')
plt.show()
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(14,11))
ax = fig.gca(projection = "3d")
#plt.subplot(111,projection = "3d")
plot = ax.scatter(concrete_df["cement"],
concrete_df["strength"],
concrete_df["superplastic"],
linewidth=1,edgecolor ="k",
c=concrete_df["age"],s=100,cmap="cool")
ax.set_xlabel("cement")
ax.set_ylabel("strength")
ax.set_zlabel("superplastic")
lab = fig.colorbar(plot,shrink=.5,aspect=5)
lab.set_label("AGE",fontsize = 15)
plt.title("3D plot for cement,strength and super plastic",color="navy")
plt.show()
plt.subplots(figsize=(12, 6))
ax = sns.boxplot(data=concrete_df)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45);
from scipy.stats import zscore
import scipy.stats as stats
#Let's check Skew in all numercial attributes
Skewness = pd.DataFrame({'Skewness' : [stats.skew(concrete_df.cement),
stats.skew(concrete_df.slag),
stats.skew(concrete_df.ash),
stats.skew(concrete_df.water),
stats.skew(concrete_df.superplastic),
stats.skew(concrete_df.coarseagg),
stats.skew(concrete_df.fineagg),
stats.skew(concrete_df.age),
stats.skew(concrete_df.strength)]},
index=['cement','slag','ash', 'water', 'superplastic', 'coarseagg', 'fineagg','age','strength']) # Measure the skeweness of the required columns
Skewness
print('Range of values: ', concrete_df['cement'].max()-concrete_df['cement'].min())
#Central values
print('Minimum age: ', concrete_df['cement'].min())
print('Maximum age: ',concrete_df['cement'].max())
print('Mean value: ', concrete_df['cement'].mean())
print('Median value: ',concrete_df['cement'].median())
print('Standard deviation: ', concrete_df['cement'].std())
#Quartiles
Q1=concrete_df['cement'].quantile(q=0.25)
Q3=concrete_df['cement'].quantile(q=0.75)
print('1st Quartile (Q1) is: ', Q1)
print('3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) is ', stats.iqr(concrete_df['cement']))
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in cement: ', L_outliers)
print('Upper outliers in cement: ', U_outliers)
print('Number of outliers in cement upper : ', concrete_df[concrete_df['cement']>586.4375]['cement'].count())
print('Number of outliers in cement lower : ', concrete_df[concrete_df['cement']<-44.0625]['cement'].count())
print('% of Outlier in cement upper: ',round(concrete_df[concrete_df['cement']>586.4375]['cement'].count()*100/len(concrete_df)), '%')
print('% of Outlier in cement lower: ',round(concrete_df[concrete_df['cement']<-44.0625]['cement'].count()*100/len(concrete_df)), '%')
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))
#boxplot
sns.boxplot(x='cement',data=concrete_df,orient='v',ax=ax1,color='green')
ax1.set_ylabel('Cement', fontsize=15)
ax1.set_title('Distribution of cement', fontsize=15)
ax1.tick_params(labelsize=15)
#distplot
sns.distplot(concrete_df['cement'],ax=ax2,color='green')
ax2.set_xlabel('Cement', fontsize=15)
ax2.set_ylabel('Strength', fontsize=15)
ax2.set_title('Cement vs Strength', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(concrete_df['cement'],color='green')
ax3.set_xlabel('Cement', fontsize=15)
ax3.set_ylabel('Strength', fontsize=15)
ax3.set_title('Cement vs Strength', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
print('Range of values: ', concrete_df['slag'].max()-concrete_df['slag'].min())
print('Minimum slag: ', concrete_df['slag'].min())
print('Maximum slag: ',concrete_df['slag'].max())
print('Mean value: ', concrete_df['slag'].mean())
print('Median value: ',concrete_df['slag'].median())
print('Standard deviation: ', concrete_df['slag'].std())
print('Null values: ',concrete_df['slag'].isnull().any())
Q1=concrete_df['slag'].quantile(q=0.25)
Q3=concrete_df['slag'].quantile(q=0.75)
print('1st Quartile (Q1) is: ', Q1)
print('3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) is ', stats.iqr(concrete_df['slag']))
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in slag: ', L_outliers)
print('Upper outliers in slag: ', U_outliers)
print('Number of outliers in slag upper : ', concrete_df[concrete_df['slag']>357.375]['slag'].count())
print('Number of outliers in slag lower : ', concrete_df[concrete_df['slag']<-214.425]['slag'].count())
print('% of Outlier in slag upper: ',round(concrete_df[concrete_df['slag']>357.375]['slag'].count()*100/len(concrete_df)), '%')
print('% of Outlier in slag lower: ',round(concrete_df[concrete_df['slag']<-214.425]['slag'].count()*100/len(concrete_df)), '%')
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))
#boxplot
sns.boxplot(x='slag',data=concrete_df,orient='v',ax=ax1,color='green')
ax1.set_ylabel('Slag', fontsize=15)
ax1.set_title('Distribution of slag', fontsize=15)
ax1.tick_params(labelsize=15)
#distplot
sns.distplot(concrete_df['slag'],ax=ax2,color='green')
ax2.set_xlabel('Slag', fontsize=15)
ax2.set_ylabel('Strength', fontsize=15)
ax2.set_title('Slag vs Strength', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(concrete_df['slag'],color='green')
ax3.set_xlabel('Slag', fontsize=15)
ax3.set_ylabel('Strength', fontsize=15)
ax3.set_title('Slag vs Strength', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
print('Range of values: ', concrete_df['ash'].max()-concrete_df['ash'].min())
print('Minimum ash: ', concrete_df['ash'].min())
print('Maximum ash: ',concrete_df['ash'].max())
print('Mean value: ', concrete_df['ash'].mean())
print('Median value: ',concrete_df['ash'].median())
print('Standard deviation: ', concrete_df['ash'].std())
Q1=concrete_df['ash'].quantile(q=0.25)
Q3=concrete_df['ash'].quantile(q=0.75)
print('1st Quartile (Q1) is: ', Q1)
print('3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) is ', stats.iqr(concrete_df['ash']))
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in ash: ', L_outliers)
print('Upper outliers in ash: ', U_outliers)
print('Number of outliers in ash upper : ', concrete_df[concrete_df['ash']>295.75]['ash'].count())
print('Number of outliers in ash lower : ', concrete_df[concrete_df['ash']<-177.45]['ash'].count())
print('% of Outlier in ash upper: ',round(concrete_df[concrete_df['ash']>295.75]['ash'].count()*100/len(concrete_df)), '%')
print('% of Outlier in ash lower: ',round(concrete_df[concrete_df['ash']<-177.45]['ash'].count()*100/len(concrete_df)), '%')
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))
#boxplot
sns.boxplot(x='ash',data=concrete_df,orient='v',ax=ax1,color='green')
ax1.set_ylabel('Ash', fontsize=15)
ax1.set_title('Distribution of ash', fontsize=15)
ax1.tick_params(labelsize=15)
#distplot
sns.distplot(concrete_df['ash'],ax=ax2,color='green')
ax2.set_xlabel('Ash', fontsize=15)
ax2.set_ylabel('Strength', fontsize=15)
ax2.set_title('Ash vs Strength', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(concrete_df['ash'],color='green')
ax3.set_xlabel('Ash', fontsize=15)
ax3.set_ylabel('Strength', fontsize=15)
ax3.set_title('Ash vs Strength', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
print('Range of values: ', concrete_df['water'].max()-concrete_df['water'].min())
#Central values
print('Minimum water: ', concrete_df['water'].min())
print('Maximum water: ',concrete_df['water'].max())
print('Mean value: ', concrete_df['water'].mean())
print('Median value: ',concrete_df['water'].median())
print('Standard deviation: ', concrete_df['water'].std())
print('Null values: ',concrete_df['water'].isnull().any())
#Quartiles
Q1=concrete_df['water'].quantile(q=0.25)
Q3=concrete_df['water'].quantile(q=0.75)
print('1st Quartile (Q1) is: ', Q1)
print('3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) is ', stats.iqr(concrete_df['water']))
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in water: ', L_outliers)
print('Upper outliers in water: ', U_outliers)
print('Number of outliers in water upper : ', concrete_df[concrete_df['water']>232.65]['water'].count())
print('Number of outliers in water lower : ', concrete_df[concrete_df['water']<124.25]['water'].count())
print('% of Outlier in water upper: ',round(concrete_df[concrete_df['water']>232.65]['water'].count()*100/len(concrete_df)), '%')
print('% of Outlier in water lower: ',round(concrete_df[concrete_df['water']<124.25]['water'].count()*100/len(concrete_df)), '%')
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))
#boxplot
sns.boxplot(x='water',data=concrete_df,orient='v',ax=ax1,color='green')
ax1.set_ylabel('Water', fontsize=15)
ax1.set_title('Distribution of water', fontsize=15)
ax1.tick_params(labelsize=15)
#distplot
sns.distplot(concrete_df['water'],ax=ax2,color='green')
ax2.set_xlabel('Water', fontsize=15)
ax2.set_ylabel('Strength', fontsize=15)
ax2.set_title('Water vs Strength', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(concrete_df['water'],color='green')
ax3.set_xlabel('Water', fontsize=15)
ax3.set_ylabel('Strength', fontsize=15)
ax3.set_title('Water vs Strength', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
print('Range of values: ', concrete_df['superplastic'].max()-concrete_df['superplastic'].min())
print('Minimum superplastic: ', concrete_df['superplastic'].min())
print('Maximum superplastic: ',concrete_df['superplastic'].max())
print('Mean value: ', concrete_df['superplastic'].mean())
print('Median value: ',concrete_df['superplastic'].median())
print('Standard deviation: ', concrete_df['superplastic'].std())
print('Null values: ',concrete_df['superplastic'].isnull().any())
Q1=concrete_df['superplastic'].quantile(q=0.25)
Q3=concrete_df['superplastic'].quantile(q=0.75)
print('1st Quartile (Q1) is: ', Q1)
print('3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) is ', stats.iqr(concrete_df['superplastic']))
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in superplastic: ', L_outliers)
print('Upper outliers in superplastic: ', U_outliers)
print('Number of outliers in superplastic upper : ', concrete_df[concrete_df['superplastic']>25.5]['superplastic'].count())
print('Number of outliers in superplastic lower : ', concrete_df[concrete_df['superplastic']<-15.3]['superplastic'].count())
print('% of Outlier in superplastic upper: ',round(concrete_df[concrete_df['superplastic']>25.5]['superplastic'].count()*100/len(concrete_df)), '%')
print('% of Outlier in superplastic lower: ',round(concrete_df[concrete_df['superplastic']<-15.3]['superplastic'].count()*100/len(concrete_df)), '%')
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))
#boxplot
sns.boxplot(x='superplastic',data=concrete_df,orient='v',ax=ax1,color='green')
ax1.set_ylabel('Superplastic', fontsize=15)
ax1.set_title('Distribution of superplastic', fontsize=15)
ax1.tick_params(labelsize=15)
#distplot
sns.distplot(concrete_df['superplastic'],ax=ax2,color='green')
ax2.set_xlabel('Superplastic', fontsize=15)
ax2.set_ylabel('Strength', fontsize=15)
ax2.set_title('Superplastic vs Strength', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(concrete_df['superplastic'],color='green')
ax3.set_xlabel('Superplastic', fontsize=15)
ax3.set_ylabel('Strength', fontsize=15)
ax3.set_title('Superplastic vs Strength', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
print('Range of values: ', concrete_df['coarseagg'].max()-concrete_df['coarseagg'].min())
print('Minimum value: ', concrete_df['coarseagg'].min())
print('Maximum value: ',concrete_df['coarseagg'].max())
print('Mean value: ', concrete_df['coarseagg'].mean())
print('Median value: ',concrete_df['coarseagg'].median())
print('Standard deviation: ', concrete_df['coarseagg'].std())
print('Null values: ',concrete_df['coarseagg'].isnull().any())
Q1=concrete_df['coarseagg'].quantile(q=0.25)
Q3=concrete_df['coarseagg'].quantile(q=0.75)
print('1st Quartile (Q1) is: ', Q1)
print('3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) is ', stats.iqr(concrete_df['coarseagg']))
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in coarseagg: ', L_outliers)
print('Upper outliers in coarseagg: ', U_outliers)
print('Number of outliers in coarseagg upper : ', concrete_df[concrete_df['coarseagg']>1175.5]['coarseagg'].count())
print('Number of outliers in coarseagg lower : ', concrete_df[concrete_df['coarseagg']<785.9]['coarseagg'].count())
print('% of Outlier in coarseagg upper: ',round(concrete_df[concrete_df['coarseagg']>1175.5]['coarseagg'].count()*100/len(concrete_df)), '%')
print('% of Outlier in coarseagg lower: ',round(concrete_df[concrete_df['coarseagg']<785.9]['coarseagg'].count()*100/len(concrete_df)), '%')
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))
#boxplot
sns.boxplot(x='coarseagg',data=concrete_df,orient='v',ax=ax1,color='green')
ax1.set_ylabel('Coarseagg', fontsize=15)
ax1.set_title('Distribution of coarseagg', fontsize=15)
ax1.tick_params(labelsize=15)
#distplot
sns.distplot(concrete_df['coarseagg'],ax=ax2,color='green')
ax2.set_xlabel('Coarseagg', fontsize=15)
ax2.set_ylabel('Strength', fontsize=15)
ax2.set_title('Coarseagg vs Strength', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(concrete_df['coarseagg'],color='green')
ax3.set_xlabel('Coarseagg', fontsize=15)
ax3.set_ylabel('Strength', fontsize=15)
ax3.set_title('Coarseagg vs Strength', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
print('Range of values: ', concrete_df['fineagg'].max()-concrete_df['fineagg'].min())
print('Minimum value: ', concrete_df['fineagg'].min())
print('Maximum value: ',concrete_df['fineagg'].max())
print('Mean value: ', concrete_df['fineagg'].mean())
print('Median value: ',concrete_df['fineagg'].median())
print('Standard deviation: ', concrete_df['fineagg'].std())
print('Null values: ',concrete_df['fineagg'].isnull().any())
Q1=concrete_df['fineagg'].quantile(q=0.25)
Q3=concrete_df['fineagg'].quantile(q=0.75)
print('1st Quartile (Q1) is: ', Q1)
print('3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) is ', stats.iqr(concrete_df['fineagg']))
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in fineagg: ', L_outliers)
print('Upper outliers in fineagg: ', U_outliers)
print('Number of outliers in fineagg upper : ', concrete_df[concrete_df['fineagg']>963.575]['fineagg'].count())
print('Number of outliers in fineagg lower : ', concrete_df[concrete_df['fineagg']<591.37]['fineagg'].count())
print('% of Outlier in fineagg upper: ',round(concrete_df[concrete_df['fineagg']>963.575]['fineagg'].count()*100/len(concrete_df)), '%')
print('% of Outlier in fineagg lower: ',round(concrete_df[concrete_df['fineagg']<591.37]['fineagg'].count()*100/len(concrete_df)), '%')
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))
#boxplot
sns.boxplot(x='fineagg',data=concrete_df,orient='v',ax=ax1,color='green')
ax1.set_ylabel('Fineagg', fontsize=15)
ax1.set_title('Distribution of fineagg', fontsize=15)
ax1.tick_params(labelsize=15)
#distplot
sns.distplot(concrete_df['fineagg'],ax=ax2,color='green')
ax2.set_xlabel('Fineagg', fontsize=15)
ax2.set_ylabel('Strength', fontsize=15)
ax2.set_title('Fineagg vs Strength', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(concrete_df['fineagg'],color='green')
ax3.set_xlabel('Fineagg', fontsize=15)
ax3.set_ylabel('Strength', fontsize=15)
ax3.set_title('Fineagg vs Strength', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
print('Range of values: ', concrete_df['age'].max()-concrete_df['age'].min())
print('Minimum age: ', concrete_df['age'].min())
print('Maximum age: ',concrete_df['age'].max())
print('Mean value: ', concrete_df['age'].mean())
print('Median value: ',concrete_df['age'].median())
print('Standard deviation: ', concrete_df['age'].std())
print('Null values: ',concrete_df['age'].isnull().any())
Q1=concrete_df['age'].quantile(q=0.25)
Q3=concrete_df['age'].quantile(q=0.75)
print('1st Quartile (Q1) is: ', Q1)
print('3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) is ', stats.iqr(concrete_df['age']))
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in age: ', L_outliers)
print('Upper outliers in age: ', U_outliers)
print('Number of outliers in age upper : ', concrete_df[concrete_df['age']>129.5]['age'].count())
print('Number of outliers in age lower : ', concrete_df[concrete_df['age']<-66.5]['age'].count())
print('% of Outlier in age upper: ',round(concrete_df[concrete_df['age']>129.5]['age'].count()*100/len(concrete_df)), '%')
print('% of Outlier in age lower: ',round(concrete_df[concrete_df['age']<-66.5]['age'].count()*100/len(concrete_df)), '%')
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))
#boxplot
sns.boxplot(x='age',data=concrete_df,orient='v',ax=ax1,color='green')
ax1.set_ylabel('Age', fontsize=15)
ax1.set_title('Distribution of age', fontsize=15)
ax1.tick_params(labelsize=15)
#distplot
sns.distplot(concrete_df['age'],ax=ax2,color='green')
ax2.set_xlabel('Age', fontsize=15)
ax2.set_ylabel('Strength', fontsize=15)
ax2.set_title('Age vs Strength', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(concrete_df['age'],color='green')
ax3.set_xlabel('Age', fontsize=15)
ax3.set_ylabel('Strength', fontsize=15)
ax3.set_title('Age vs Strength', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
scatter_age_balance = concrete_df.plot.scatter('cement','strength',figsize = (15,10),color='green')
plt.title('Age and strength ')
plt.show()
scatter_age_balance = concrete_df.plot.scatter('superplastic','strength',figsize = (15,10),color='green')
plt.title('Age and strength ')
plt.show()
# Distplot
fig, ax2 = plt.subplots(3, 3, figsize=(16, 16))
sns.distplot(concrete_df['cement'],ax=ax2[0][0],color='green')
sns.distplot(concrete_df['slag'],ax=ax2[0][1],color='green')
sns.distplot(concrete_df['ash'],ax=ax2[0][2],color='green')
sns.distplot(concrete_df['water'],ax=ax2[1][0],color='green')
sns.distplot(concrete_df['superplastic'],ax=ax2[1][1],color='green')
sns.distplot(concrete_df['coarseagg'],ax=ax2[1][2],color='green')
sns.distplot(concrete_df['fineagg'],ax=ax2[2][0],color='green')
sns.distplot(concrete_df['age'],ax=ax2[2][1],color='green')
sns.distplot(concrete_df['strength'],ax=ax2[2][2],color='green')
Observation
We can see observe that :
# Histogram
concrete_df.hist(figsize=(15,15),color='green')
## pairplot- plot density curve instead of histogram in diagonal
sns.pairplot(concrete_df, diag_kind='kde')
corr = concrete_df.corr()
plt.figure(figsize = (10,8))
sns.heatmap(corr, cmap = 'Blues', annot = True)
plt.title('Pearson Correlation Coefficients')
plt.show()
corr_sorted = corr.unstack().sort_values(kind='quicksort', ascending = False)
print(corr_sorted[corr_sorted!=1].head(10))
print(corr_sorted[corr_sorted!=1].tail(10))
fig, ax = plt.subplots(figsize=(10,8))
sns.scatterplot(y="strength", x="cement", hue="water", size="age", data=concrete_df, ax=ax, sizes=(50, 300),
palette='RdYlBu', alpha=0.9)
ax.set_title("Strength vs Cement, Age, Water")
ax.legend()
plt.show()
Strength correlates positively with cement
Strength correlates positively with Age, though less than cement
Older Cement tends to require more Water, as shown by the larger green data points
Strength correlates negatively with Water
High Strength with a low Age requires more cement
fig, ax = plt.subplots(figsize=(10,8))
sns.scatterplot(y="strength", x="fineagg", hue="ash", size="superplastic", data=concrete_df, ax=ax, sizes=(50, 300),
palette='RdYlBu', alpha=0.9)
ax.set_title("Strength vs fineagg, ash, superplastic")
ax.legend(loc="upper left", bbox_to_anchor=(1,1)) # Moved outside the chart so it doesn't cover any data
plt.show()
strength correlates negatively withFly Ash
strength correlates positively with Superplastic
cement vs other independent attributes: This attribute does not have any significant relation with slag, ash, water, superplatic, coarseagg,fineagg and age. It almost spread like a cloud.
slag vs other independent attributes: This attribute also does not have any significant relation with ash, water, superplatic, coarseagg,fineagg and age.
ash vs other independent attributes: This attribute also does not have any significant relation with water, superplatic, coarseagg,fineagg and age.
water vs other independent attributes: This attribute have negative linear relationship with superplastic and fineagg.
strength vs cement: It is linearly related to the cement. The relationship is positive and we can see that for a given value of cement we have a multiple values of strength. Hence Cement though it has positive relationship with the strength, it is not a very good predictor. It is a weak predictor.
strength vs slag: There is no particular trend.
strength vs slag: There is also no particular trend.
strength vs age: For a given value of age, we have different values of strength. Hence, It is not a good predictor.
strength vs superplastic:For a given value of age, we have different values of strength. Hence, It is not a good predictor.
Other attributes does not give any strong relationship with strength.
Hence, we can see that none of the independent attributes are a good predictors of the strength attribute. There is a no linear relationship between them.
So,no need to use Linear model
# corrlation matrix
cor=concrete_df.corr()
cor
Here, we can see the correlation value between the attributes.
#Check for the missing values
concrete_df.isnull().sum()
#Checking for outliers
concrete_df1=concrete_df.copy()
concrete_df1.boxplot(figsize=(35,15))
#Number of outliers present in the dataset
print('Number of outliers in cement: ',concrete_df1[((concrete_df1.cement - concrete_df1.cement.mean()) / concrete_df1.cement.std()).abs() >3]['cement'].count())
print('Number of outliers in slag: ',concrete_df1[((concrete_df1.slag - concrete_df1.slag.mean()) / concrete_df1.slag.std()).abs() >3]['slag'].count())
print('Number of outliers in ash: ',concrete_df1[((concrete_df1.ash - concrete_df1.ash.mean()) / concrete_df1.ash.std()).abs() >3]['ash'].count())
print('Number of outliers in water: ',concrete_df1[((concrete_df1.water - concrete_df1.water.mean()) / concrete_df1.water.std()).abs() >3]['water'].count())
print('Number of outliers in superplastic: ',concrete_df1[((concrete_df1.superplastic - concrete_df1.superplastic.mean()) / concrete_df1.superplastic.std()).abs() >3]['superplastic'].count())
print('Number of outliers in coarseagg: ',concrete_df1[((concrete_df1.coarseagg - concrete_df1.coarseagg.mean()) / concrete_df1.coarseagg.std()).abs() >3]['coarseagg'].count())
print('Number of outliers in fineagg: ',concrete_df1[((concrete_df1.fineagg - concrete_df1.fineagg.mean()) / concrete_df1.fineagg.std()).abs() >3]['fineagg'].count())
print('Number of outliers in age: ',concrete_df1[((concrete_df1.age - concrete_df1.age.mean()) / concrete_df1.age.std()).abs() >3]['age'].count())
print('Records containing outliers in slag: \n',concrete_df1[((concrete_df1.slag - concrete_df1.slag.mean()) / concrete_df1.slag.std()).abs() >3]['slag'])
print('Records containing outliers in water: \n',concrete_df1[((concrete_df1.water - concrete_df1.water.mean()) / concrete_df1.water.std()).abs() >3]['water'])
print('Records containing outliers in superplastic: \n',concrete_df1[((concrete_df1.superplastic - concrete_df1.superplastic.mean()) / concrete_df1.superplastic.std()).abs() >3]['superplastic'])
print('Records containing outliers in age: \n',concrete_df1[((concrete_df1.age - concrete_df1.age.mean()) / concrete_df1.age.std()).abs() >3]['age'])
#Handling the outliers
#Replacing the outliers by median
for col_name in concrete_df1.columns[:-1]:
q1 = concrete_df1[col_name].quantile(0.25)
q3 = concrete_df1[col_name].quantile(0.75)
iqr = q3 - q1
low = q1-1.5*iqr
high = q3+1.5*iqr
concrete_df1.loc[(concrete_df1[col_name] < low) | (concrete_df1[col_name] > high), col_name] = concrete_df1[col_name].median()
concrete_df1.boxplot(figsize=(35,15))
#Scaling the features
concrete_df_z = concrete_df1.apply(zscore)
concrete_df_z=pd.DataFrame(concrete_df_z,columns=concrete_df.columns)
Here, all the attributes in the same scale(unit) except the age attribute. Hence, we are scaling the attributes. We are using zscore for scaling.
#Splitting the data into independent and dependent attributes
#independent and dependent variables
X=concrete_df_z.iloc[:,0:8]
y = concrete_df_z.iloc[:,8]
# compare the effect of the degree on the number of created features
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import PolynomialFeatures
from matplotlib import pyplot
# calculate change in number of features
num_features = list()
degress = [i for i in range(1, 6)]
for d in degress:
# create transform
trans = PolynomialFeatures(degree=d)
# fit and transform
data = trans.fit_transform(X)
# record number of features
num_features.append(data.shape[1])
# summarize
print('Degree: %d, Features: %d' % (d, data.shape[1]))
# plot degree vs number of features
pyplot.plot(degress, num_features)
pyplot.show()
#Splitting the data into independent and dependent attributes
#independent and dependent variables
X=concrete_df_z.iloc[:,0:8]
y = concrete_df_z.iloc[:,8]
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 1)
poly = PolynomialFeatures(degree = 2, interaction_only=False, include_bias=False)
# Importing models
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Linear Regression
lr = LinearRegression()
pipe = Pipeline([('polynomial_features',poly), ('logistic_regression',lr)])
# Ridge Regression
ridge = Ridge()
piperidge = Pipeline([('polynomial_features',poly), ('logistic_regression',ridge)])
# Fitting models on Training data
pipe.fit(X_train, y_train)
piperidge.fit(X_train, y_train)
# Making predictions on Test data
y_pred_lr = pipe.predict(X_test)
y_pred_ridge = piperidge.predict(X_test)
# performance on train data
print('Performance on training data using LR:',pipe.score(X_train,y_train))
# performance on test data
print('Performance on testing data using LR:',pipe.score(X_test,y_test))
#Evaluate the model using accuracy
acc_DT=metrics.r2_score(y_test, y_pred_lr)
print('Accuracy LR: ',acc_DT)
#Store the accuracy results for each model in a dataframe for final comparison
results = pd.DataFrame({'Method':['LR'], 'accuracy': acc_DT},index={'1'})
results = results[['Method', 'accuracy']]
results
# performance on train data
print('Performance on training data using ridge:',piperidge.score(X_train,y_train))
# performance on test data
print('Performance on testing data using ridge:',piperidge.score(X_test,y_test))
#Evaluate the model using accuracy
acc_DT=metrics.r2_score(y_test, y_pred_ridge)
print('Accuracy ridge: ',acc_DT)
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['ridge'], 'accuracy': acc_DT},index={'1'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results
#Adding into Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler,LabelEncoder, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from category_encoders import OrdinalEncoder
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
clf_pipeline = Pipeline(steps=[('polynomial_features',poly),
('classifier', DecisionTreeRegressor())])
clf_pipeline.fit(X_train , y_train)
y_pred = clf_pipeline.predict(X_test)
# performance on train data
print('Performance on training data using DT:',clf_pipeline.score(X_train,y_train))
# performance on test data
print('Performance on testing data using DT:',clf_pipeline.score(X_test,y_test))
#Evaluate the model using accuracy
acc_DT=metrics.r2_score(y_test, y_pred)
print('Accuracy DT: ',acc_DT)
from scipy.stats import pearsonr
sns.set(style="darkgrid", color_codes=True)
with sns.axes_style("white"):
sns.jointplot(x=y_test, y=y_pred, stat_func=pearsonr,kind="reg", color="k");
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Decision Tree'], 'accuracy': acc_DT},index={'1'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results
num_folds = 20
seed = 42
kfold = KFold(n_splits=num_folds, random_state=seed)
results1 = cross_val_score(clf_pipeline,X, y, cv=kfold)
accuracy=np.mean(abs(results1))
print('Average accuracy: ',accuracy)
print('Standard Deviation: ',results1.std())
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Decision Tree k fold'], 'accuracy': [accuracy]},index={'2'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results
Drop the least significant variable
concrete_df_z.info()
#Create a copy of the dataset
concrete_df2=concrete_df_z.copy()
#independent and dependent variable
X = concrete_df2.drop( ['strength','ash','coarseagg','fineagg'] , axis=1)
y = concrete_df2['strength']
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 1)
dt_model = Pipeline(steps=[('polynomial_features',poly),
('classifier', DecisionTreeRegressor())])
dt_model.fit(X_train , y_train)
y_pred = dt_model.predict(X_test)
# performance on train data
print('Performance on training data using DT:',dt_model.score(X_train,y_train))
# performance on test data
print('Performance on testing data using DT:',dt_model.score(X_test,y_test))
#Evaluate the model using accuracy
acc_DT=metrics.r2_score(y_test, y_pred)
print('Accuracy DT: ',acc_DT)
it is an overfit model.
from scipy.stats import pearsonr
sns.set(style="darkgrid", color_codes=True)
with sns.axes_style("white"):
sns.jointplot(x=y_test, y=y_pred, stat_func=pearsonr,kind="reg", color="k");
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Decision Tree2'], 'accuracy': [acc_DT]},index={'3'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results
#independent and dependent variables
X=concrete_df_z.iloc[:,0:8]
y = concrete_df_z.iloc[:,8]
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 1)
# Regularizing the Decision tree classifier and fitting the model
reg_dt_model = DecisionTreeRegressor( max_depth = 4,random_state=1,min_samples_leaf=5)
reg_dt_model.fit(X_train, y_train)
print (pd.DataFrame(reg_dt_model.feature_importances_, columns = ["Imp"], index = X_train.columns))
Here, we can see that ash,coarseagg and fineagg are least significant variable.
Visualizing the Regularized Tree
y_pred = reg_dt_model.predict(X_test)
# performance on train data
print('Performance on training data using DT:',reg_dt_model.score(X_train,y_train))
# performance on test data
print('Performance on testing data using DT:',reg_dt_model.score(X_test,y_test))
#Evaluate the model using accuracy
acc_RDT=metrics.r2_score(y_test, y_pred)
print('Accuracy DT: ',acc_RDT)
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Pruned Decision Tree'], 'accuracy': [acc_RDT]},index={'4'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results
num_folds = 20
seed = 42
kfold = KFold(n_splits=num_folds, random_state=seed)
results1 = cross_val_score(reg_dt_model,X, y, cv=kfold)
accuracy=np.mean(abs(results1))
print('Average accuracy: ',accuracy)
print('Standard Deviation: ',results1.std())
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Pruned Decision Tree k fold'], 'accuracy': [accuracy]},index={'5'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results
#Create a copy of the dataset
concrete_df3=concrete_df_z.copy()
#independent and dependent variable
X = concrete_df3.drop( ['strength','ash','coarseagg','fineagg'] , axis=1)
y = concrete_df3['strength']
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 1)
X_train
# Regularizing the Decision tree classifier and fitting the model
reg_dt_model = DecisionTreeRegressor( max_depth = 4,random_state=1,min_samples_leaf=5)
reg_dt_model.fit(X_train, y_train)
y_pred = reg_dt_model.predict(X_test)
# performance on train data
print('Performance on training data using DT:',reg_dt_model.score(X_train,y_train))
# performance on test data
print('Performance on testing data using DT:',reg_dt_model.score(X_test,y_test))
#Evaluate the model using accuracy
acc_RDT=metrics.r2_score(y_test, y_pred)
print('Accuracy DT: ',acc_RDT)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Pruned Decision Tree2'], 'accuracy': [acc_RDT]},index={'6'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results
poly = PolynomialFeatures(degree = 1, interaction_only=False, include_bias=False)
model= Pipeline(steps=[('polynomial_features',poly),
('classifier', RandomForestRegressor())])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# performance on train data
print('Performance on training data using RFR:',model.score(X_train,y_train))
# performance on test data
print('Performance on testing data using RFR:',model.score(X_test,y_test))
#Evaluate the model using accuracy
acc_RFR=metrics.r2_score(y_test, y_pred)
print('Accuracy DT: ',acc_RFR)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Random Forest Regressor'], 'accuracy': [acc_RFR]},index={'7'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results
X_train
num_folds = 20
seed = 77
kfold = KFold(n_splits=num_folds, random_state=seed)
results1 = cross_val_score(model,X, y, cv=kfold)
accuracy=np.mean(abs(results1))
print('Average accuracy: ',accuracy)
print('Standard Deviation: ',results1.std())
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Random Forest Regressor k fold'], 'accuracy': [accuracy]},index={'8'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results
model= Pipeline(steps=[('polynomial_features',poly),
('classifier', GradientBoostingRegressor())])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# performance on train data
print('Performance on training data using GBR:',model.score(X_train,y_train))
# performance on test data
print('Performance on testing data using GBR:',model.score(X_test,y_test))
#Evaluate the model using accuracy
acc_GBR=metrics.r2_score(y_test, y_pred)
print('Accuracy DT: ',acc_GBR)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Gradient Boost Regressor'], 'accuracy': [acc_GBR]},index={'9'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results
num_folds = 20
seed = 77
kfold = KFold(n_splits=num_folds, random_state=seed)
results1 = cross_val_score(model,X, y, cv=kfold)
accuracy=np.mean(abs(results1))
print('Average accuracy: ',accuracy)
print('Standard Deviation: ',results1.std())
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Gradient Boost Regressor k fold'], 'accuracy': [accuracy]},index={'10'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results
model= Pipeline(steps=[('polynomial_features',poly),
('classifier', BaggingRegressor())])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# performance on train data
print('Performance on training data using GBR:',model.score(X_train,y_train))
# performance on test data
print('Performance on testing data using GBR:',model.score(X_test,y_test))
#Evaluate the model using accuracy
acc_ABR=metrics.r2_score(y_test, y_pred)
print('Accuracy DT: ',acc_ABR)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['BaggingRegressor'], 'accuracy': [acc_ABR]},index={'11'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results
model= Pipeline(steps=[('polynomial_features',poly),
('classifier', AdaBoostRegressor())])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# performance on train data
print('Performance on training data using GBR:',model.score(X_train,y_train))
# performance on test data
print('Performance on testing data using GBR:',model.score(X_test,y_test))
#Evaluate the model using accuracy
acc_ABR=metrics.r2_score(y_test, y_pred)
print('Accuracy DT: ',acc_ABR)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))
X_train
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Ada Boosting Regressor'], 'accuracy': [acc_ABR]},index={'11'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results
num_folds = 18
seed = 77
kfold = KFold(n_splits=num_folds, random_state=seed)
results1 = cross_val_score(model,X, y, cv=kfold)
accuracy=np.mean(abs(results1))
print('Average accuracy: ',accuracy)
print('Standard Deviation: ',results1.std())
tempResultsDf = pd.DataFrame({'Method':['Ada Boosting Regressor k fold'], 'accuracy': [accuracy]},index={'12'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results.sort_values(by='accuracy', ascending=False)
X_train
After applying all the models we can see that Random Forest Regressor, Random Forest Regressor k fold, Gradient Boost Regressor, Gradient Boost Regressor k fold, Bagging Regressor are giving better results as compared to other models.
Techniques employed to squeeze that extra performance out of the model without making it overfit. Use Grid Search or Random Search on any of the two models used above. Make a DataFrame to compare models after hyperparameter tuning and their metrics as above. (15 marks)
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
param_grid = {
'bootstrap': [True],
'max_depth': [10],
'max_features': ['log2'],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [5,10],
'n_estimators': np.arange(50, 71)
}
rfg = RandomForestRegressor(random_state = 7)
grid_search = GridSearchCV(estimator = rfg, param_grid = param_grid,
cv = kfold, n_jobs = 1, verbose = 0, return_train_score=True)
grid_search.fit(X_train, y_train.values.ravel());
grid_search.best_params_
# best ensemble model (with optimal combination of hyperparameters)
rfTree = grid_search.best_estimator_
rfTree.fit(X_train, y_train.values.ravel())
print('Random Forest Regressor')
rfTree_train_score = rfTree.score(X_train, y_train)
print("Random Forest Regressor Model Training Set Score:", rfTree_train_score)
rfTree_rmse = np.sqrt((-1) * cross_val_score(rfTree, X_train, y_train.values.ravel(), cv=kfold, scoring='neg_mean_squared_error').mean())
print("Random Forest Regressor Model RMSE :", rfTree_rmse)
rfTree_r2 = cross_val_score(rfTree, X_train, y_train.values.ravel(), cv=kfold, scoring='r2').mean()
print("Random Forest Regressor Model R-Square Value :", rfTree_r2)
rfTree_random_model_df = pd.DataFrame({'Trainng Score': [rfTree_train_score],
'RMSE': [rfTree_rmse],
'R Squared': [rfTree_r2]})
rfTree_random_model_df
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
param_grid={'n_estimators':[100,500],
'learning_rate': [0.1,0.05,0.02],
'max_depth':[4],
'min_samples_leaf':[3],
'max_features':[1.0] }
rfg = GradientBoostingRegressor(n_estimators=50)
grid_search = GridSearchCV(estimator = rfg, param_grid = param_grid,
cv = kfold, n_jobs = 1, verbose = 0, return_train_score=True)
grid_search.fit(X_train, y_train.values.ravel());
# best ensemble model (with optimal combination of hyperparameters)
rfTree = grid_search.best_estimator_
rfTree.fit(X_train, y_train.values.ravel())
print('Graident Boosting Regressor')
rfTree_train_score = rfTree.score(X_train, y_train)
print("Graident Boosting Model Training Set Score:", rfTree_train_score)
rfTree_rmse = np.sqrt((-1) * cross_val_score(rfTree, X_train, y_train.values.ravel(), cv=kfold, scoring='neg_mean_squared_error').mean())
print("Graident Boosting Regressor Model RMSE :", rfTree_rmse)
rfTree_r2 = cross_val_score(rfTree, X_train, y_train.values.ravel(), cv=kfold, scoring='r2').mean()
print("Graident Boosting Regressor Model R-Square Value :", rfTree_r2)
rfTree_random_model_gf = pd.DataFrame({'Trainng Score': [rfTree_train_score],
'RMSE': [rfTree_rmse],
'R Squared': [rfTree_r2]})
rfTree_random_model_gf
finalResults = pd.concat([rfTree_random_model_df, rfTree_random_model_gf])
finalResults
rf = RandomForestRegressor(random_state = 7)
print('Parameters currently in use:\n')
print(rf.get_params())
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 10 , stop = 100, num = 3)] # returns evenly spaced 10 numbers
# Number of features to consider at every split
max_features = ['auto', 'log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 10, num = 2)] # returns evenly spaced numbers can be changed to any
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
print(random_grid)
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,
n_iter = 5, scoring='neg_mean_absolute_error',
cv = kfold, verbose=2, random_state=7, n_jobs=-1,
return_train_score=True)
# Fit the random search model
rf_random.fit(X_train, y_train.values.ravel());
rfTree = rf_random.best_estimator_
rfTree.fit(X_train, y_train.values.ravel())
print('Random Forest Regressor')
rfTree_train_score = rfTree.score(X_train, y_train)
print("Random Forest Regressor Model Training Set Score:",rfTree_train_score)
rfTree_rmse = np.sqrt((-1) * cross_val_score(rfTree, X_train, y_train.values.ravel(), cv=kfold, scoring='neg_mean_squared_error').mean())
print("Random Forest Regressor Model RMSE :", rfTree_rmse)
rfTree_r2 = cross_val_score(rfTree, X_train, y_train.values.ravel(), cv=kfold, scoring='r2').mean()
print("Random Forest Regressor Model R-Square Value :", rfTree_r2)
rfTree_random_model_df_rs = pd.DataFrame({'Trainng Score': [rfTree_train_score],
'RMSE': [rfTree_rmse],
'R Squared': [rfTree_r2]})
rfTree_random_model_df_rs
rf = GradientBoostingRegressor(random_state = 50)
print('Parameters currently in use:\n')
print(rf.get_params())
# Number of trees in random forest
from scipy.stats import uniform as sp_randFloat
from scipy.stats import randint as sp_randInt
n_estimators = [int(x) for x in np.linspace(start = 10 , stop = 100, num = 3)] # returns evenly spaced 10 numbers
# Number of features to consider at every split
max_features = ['auto', 'log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 10, num = 2)] # returns evenly spaced numbers can be changed to any
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
parameters = {'learning_rate': sp_randFloat(),
'subsample' : sp_randFloat(),
'n_estimators' : sp_randInt(100, 1000),
'max_depth' : sp_randInt(4, 10)
}
print(parameters)
randm = RandomizedSearchCV(estimator=rf, param_distributions = parameters,
cv = 2, n_iter = 10, n_jobs=-1)
randm.fit(X_train, y_train)
print("\n The best parameters across ALL searched params:\n",
randm.best_params_)
rfTree = randm.best_estimator_
rfTree.fit(X_train, y_train.values.ravel())
print('Random Forest Regressor')
rfTree_train_score = rfTree.score(X_train, y_train)
print("Random Forest Regressor Model Training Set Score:",rfTree_train_score)
rfTree_rmse = np.sqrt((-1) * cross_val_score(rfTree, X_train, y_train.values.ravel(), cv=kfold, scoring='neg_mean_squared_error').mean())
print("Random Forest Regressor Model RMSE :", rfTree_rmse)
rfTree_r2 = cross_val_score(rfTree, X_train, y_train.values.ravel(), cv=kfold, scoring='r2').mean()
print("Random Forest Regressor Model R-Square Value :", rfTree_r2)
rfTree_random_model_gf_rs = pd.DataFrame({'Trainng Score': [rfTree_train_score],
'RMSE': [rfTree_rmse],
'R Squared': [rfTree_r2]})
rfTree_random_model_df_rs
FINAL_RESULT = pd.concat([rfTree_random_model_df, rfTree_random_model_gf,rfTree_random_model_df_rs, rfTree_random_model_gf_rs])
FINAL_RESULT
methods = ['GridSearch-RandomForestRegressor', 'GridSearch-Gradient Boosting Regressor',
'RandomSearch-RandomForestRegressor',
'RandomSearch-Gradient Boosting Regressor']
FINAL_RESULT['method'] = methods
FINAL_RESULT.sort_values(by='R Squared', ascending=False)
concrete_XY = X.join(y)
values = concrete_XY.values
# Number of bootstrap samples to create
n_iterations = 1000
# size of a bootstrap sample
n_size = int(len(concrete_df_z) * 1)
# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
stats = list()
for i in range(n_iterations):
# prepare train and test sets
train = resample(values, n_samples=n_size) # Sampling with replacement
test = np.array([x for x in values if x.tolist() not in train.tolist()]) # picking rest of the data not considered in sample
# fit model
gbmTree = GradientBoostingRegressor(n_estimators=50)
# fit against independent variables and corresponding target values
gbmTree.fit(train[:,:-1], train[:,-1])
# Take the target column for all rows in test set
y_test = test[:,-1]
# evaluate model
# predict based on independent variables in the test data
score = gbmTree.score(test[:, :-1] , y_test)
predictions = gbmTree.predict(test[:, :-1])
stats.append(score)
# plot scores
from matplotlib import pyplot
pyplot.hist(stats)
pyplot.show()
# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on each side indicated by P value (border)
lower = max(0.0, np.percentile(stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100, upper*100))
values = concrete_XY.values
# Number of bootstrap samples to create
n_iterations = 1000
# size of a bootstrap sample
n_size = int(len(concrete_df_z) * 1)
# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
stats = list()
for i in range(n_iterations):
# prepare train and test sets
train = resample(values, n_samples=n_size) # Sampling with replacement
test = np.array([x for x in values if x.tolist() not in train.tolist()]) # picking rest of the data not considered in sample
# fit model
rfTree = RandomForestRegressor(n_estimators=100)
# fit against independent variables and corresponding target values
rfTree.fit(train[:,:-1], train[:,-1])
# Take the target column for all rows in test set
y_test = test[:,-1]
# evaluate model
# predict based on independent variables in the test data
score = rfTree.score(test[:, :-1] , y_test)
predictions = rfTree.predict(test[:, :-1])
stats.append(score)
# plot scores
from matplotlib import pyplot
pyplot.hist(stats)
pyplot.show()
# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on each side indicated by P value (border)
lower = max(0.0, np.percentile(stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100, upper*100))
The bootstrap random forest classification model performance is between 84%-90.3% which is better than other classification algorithms.
# Helper classes
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_val_score
from sklearn.impute import SimpleImputer
import pandas as pd
import numpy as np
class Regressor(object):
"""
Class representing a regressor.
Based on the parameters supplied in the constructor, this class constructs a pipeline object.
The constructed pipeline adds
- Standard scalar if the scale parameter is passed as True
- Polynomial Feature transformations if the include_polymomial flag is set as True
"""
def __init__(self, name, model, scale=True, include_polynomial=False, degree=2):
self.name = name
self.model = model
steps = []
if scale:
steps.append(('scaler', StandardScaler()))
if include_polynomial:
steps.append(('poly_features', PolynomialFeatures(degree=degree)))
steps.append(('model', model))
self.steps = steps
def get_name(self):
return self.name
def get_model(self):
return self.model
def get(self):
return Pipeline(steps=self.steps)
def feature_imp(self):
try:
return self.model.feature_importances_
except AttributeError:
try:
return self.model.coef_
except AttributeError:
return None
class ModelsBuilder(object):
'''
This class is responsible for building the model and constructing a results dataframe.
It accepts several regressor objects.
'''
def __init__(self, regressors, data, target, test_size=0.3, seed=42):
self.regressors = regressors
self.split_data = train_test_split(data.drop(target, axis=1), data[target], test_size=test_size, random_state=seed)
self.data = data
self.target = target
def build(self, k_fold_splits=10):
results = pd.DataFrame(columns=['model', 'training_score', 'test_score', 'k_fold_mean', 'k_fold_std'])
for regressor in self.regressors:
regressor.get().fit(self.split_data[0], self.split_data[2])
cross_vals = cross_val_score(regressor.get(), self.data.drop(self.target, axis=1), self.data[self.target], cv=KFold(n_splits=k_fold_splits))
mean = round(cross_vals.mean(), 3)
std = round(cross_vals.std(), 3)
results = results.append({
'model': regressor.get_name(),
'training_score': round(regressor.get().score(self.split_data[0], self.split_data[2]), 3),
'test_score': round(regressor.get().score(self.split_data[1], self.split_data[3]),3),
'k_fold_mean': mean,
'k_fold_std': std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
}, ignore_index=True)
return results
class OutliersImputer(SimpleImputer):
'''
This class extends the functionality of SimpleImputer to handle outliers.
'''
def __init__(self, strategy='mean'):
self.strategy = strategy
super().__init__(strategy=strategy)
def fit(self, X, y=None):
for i in X.columns:
q1, q2, q3 = X[i].quantile([0.25,0.5,0.75])
IQR = q3 - q1
a = X[i] > q3 + 1.5*IQR
b = X[i] < q1 - 1.5*IQR
X[i] = np.where(a | b, np.NaN, X[i])
return super().fit(X, y)
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
regressors = [
Regressor('Linear Regression', LinearRegression(), scale=True),
Regressor('Linear Regression degree 2', LinearRegression(),
scale=True, include_polynomial=True, degree=2),
Regressor('Linear Regression degree 3', LinearRegression(),
scale=True, include_polynomial=True, degree=3),
Regressor('Ridge', Ridge(random_state=42), scale=True),
Regressor('Ridge degree 2', Ridge(random_state=42),
scale=True, include_polynomial=True, degree=2),
Regressor('Ridge degree 3', Ridge(random_state=42),
scale=True, include_polynomial=True, degree=3),
Regressor('Lasso', Lasso(random_state=42), scale=True),
Regressor('Lasso degree 2', Lasso(random_state=42),
scale=True, include_polynomial=True, degree=2),
Regressor('Lasso degree 3', Lasso(random_state=42),
scale=True, include_polynomial=True, degree=3),
Regressor('Decision Tree', DecisionTreeRegressor(random_state=42, max_depth=4), scale=True),
Regressor('Ada boosting', AdaBoostRegressor(random_state=42), scale=True),
Regressor('Random forest', RandomForestRegressor(random_state=42), scale=True),
Regressor('Gradient boosting', GradientBoostingRegressor(random_state=42), scale=True),
Regressor('KNN', KNeighborsRegressor(n_neighbors=3), scale=True),
Regressor('SVR', SVR(gamma='auto'), scale=True),
]
# Iteration 1 - Use all data
concrete_df=concrete_df_z
result = ModelsBuilder(regressors, concrete_df_z, 'strength').build()
result
# Iteration 2 - Ouliers treatment
# Count outliers
q1= concrete_df.quantile(0.25)
q3= concrete_df.quantile(0.75)
IQR = q3-q1
outliers = pd.DataFrame(((concrete_df > (q3+1.5*IQR)) | (concrete_df < (q1-IQR*1.5))).sum(axis=0), columns=['No. of outliers'])
outliers['Percentage of outliers'] = round(outliers['No. of outliers']*100/len(concrete_df), 2)
outliers
concrete_df[['age','superplastic']] = OutliersImputer().fit_transform(concrete_df[['age','superplastic']])
data=concrete_df
result_outliers_treatment = ModelsBuilder(regressors, data, 'strength').build()
result_outliers_treatment
# Iteration 3 - Remove features based on k-means clustering
result_feature_engg = ModelsBuilder(regressors, data.drop(['ash', 'coarseagg', 'fineagg'], axis=1), 'strength').build()
result_feature_engg
columns = data.drop(['ash', 'coarseagg', 'fineagg', 'strength'], axis=1).columns
feature_imp = pd.DataFrame(index=columns)
for r in regressors:
fi = r.feature_imp()
if fi is not None and len(fi) == len(columns):
feature_imp[r.get_name()] = fi
plt.figure(figsize=(12, 20))
for i, col in enumerate(feature_imp.columns):
plt.subplot(4, 2, i+1)
ax = sns.barplot(x=feature_imp.index, y=feature_imp[col])
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
plt.subplots_adjust(hspace=0.4, wspace=0.4)
-----------------------------Random Forest Algorithm has better Performance compared to others ------------------------